Exploring meta-data of human vaginal microbiome

Group 6

Alberte Englund
Mathilde Due
Line Winther Gormsen
Sigrid Frandsen
Kristine Johansen

STUDY DESCRIPTION

Meta-data from MGnify’s vaginal microbiome genome catalogue

  • Uncover patterns in genome quality, taxonomic composition, and ecological characteristics.

  • Identify potential patterns for diagnosis of endometriosis via associated pathogens of the vaginal microbiota:

    • Anaerococcus, Ureaplasma, Gardnerella, Veillonella, Corynebacterium, Peptoniphilus, Candida albicans, Alloscardovia 1

DATA CLEANING AND WRANGLING

Untidy –> tidy data

  1. Splitting the data in the “lineage” variable into multiple variables of the phylogenetic classes into seven taxonomic ranks.
  2. Covert all “not provided” to NA.
  3. Extract each taxonomic rank and remove prefixes.
  4. Convert empty strings to NA in the new taxonomy columns.
  5. Remove the GTDB suffixes (e.g. “_A”) to streamline taxonomies.
  6. Remove columns that will not be used in our analysis.
print(readr::read_tsv(here("data/_raw/genomes-all_metadata.tsv")))
# A tibble: 618 × 20
   Genome        Genome_type  Length N_contigs    N50 GC_content Completeness
   <chr>         <chr>         <dbl>     <dbl>  <dbl>      <dbl>        <dbl>
 1 MGYG000303700 MAG          678213         2 466332       47.8         63.7
 2 MGYG000303701 MAG         1500176        18 112881       42.4         87.8
 3 MGYG000303702 MAG         1210062        44  48790       26.4         94.8
 4 MGYG000303703 MAG         1706016        27  89653       44.6         93.7
 5 MGYG000303704 MAG          703182         7 111709       47.8         63.7
 6 MGYG000303705 MAG         2542045       112  34925       48           97.9
 7 MGYG000303706 MAG         1449687       185  10153       34.8         85.2
 8 MGYG000303707 MAG         1874692        90  28768       37.1         99.0
 9 MGYG000303708 MAG         1480380        12 169949       42.2         87.6
10 MGYG000303709 MAG          694644        57  15063       47.9         62.0
# ℹ 608 more rows
# ℹ 13 more variables: Contamination <dbl>, rRNA_5S <dbl>, rRNA_16S <dbl>,
#   rRNA_23S <dbl>, tRNAs <dbl>, Genome_accession <chr>, Species_rep <chr>,
#   Lineage <chr>, Sample_accession <chr>, Study_accession <chr>,
#   Country <chr>, Continent <chr>, FTP_download <chr>
untidy_data <- readr::read_tsv(
  here::here("data/_raw/genomes-all_metadata.tsv"))
  print(
    untidy_data  |>
    dplyr::select(Lineage))
# A tibble: 618 × 1
   Lineage                                                                      
   <chr>                                                                        
 1 d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__Saccharimonadales;f__Na…
 2 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Saccharofermentanales;f__Fastidi…
 3 d__Bacteria;p__Bacillota;c__Bacilli;o__Staphylococcales;f__Gemellaceae;g__Ge…
 4 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Saccharofermentanales;f__Fastidi…
 5 d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__Saccharimonadales;f__Na…
 6 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidacea…
 7 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Tissierellales;f__Peptoniphilace…
 8 d__Bacteria;p__Bacillota;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g…
 9 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Saccharofermentanales;f__Fastidi…
10 d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__Saccharimonadales;f__Na…
# ℹ 608 more rows
print(readr::read_tsv(here("data/02_dat_clean.tsv")))
# A tibble: 618 × 21
   Genome        Genome_type  Length N_contigs    N50 GC_content Completeness
   <chr>         <chr>         <dbl>     <dbl>  <dbl>      <dbl>        <dbl>
 1 MGYG000303700 MAG          678213         2 466332       47.8         63.7
 2 MGYG000303701 MAG         1500176        18 112881       42.4         87.8
 3 MGYG000303702 MAG         1210062        44  48790       26.4         94.8
 4 MGYG000303703 MAG         1706016        27  89653       44.6         93.7
 5 MGYG000303704 MAG          703182         7 111709       47.8         63.7
 6 MGYG000303705 MAG         2542045       112  34925       48           97.9
 7 MGYG000303706 MAG         1449687       185  10153       34.8         85.2
 8 MGYG000303707 MAG         1874692        90  28768       37.1         99.0
 9 MGYG000303708 MAG         1480380        12 169949       42.2         87.6
10 MGYG000303709 MAG          694644        57  15063       47.9         62.0
# ℹ 608 more rows
# ℹ 14 more variables: Contamination <dbl>, rRNA_5S <dbl>, rRNA_16S <dbl>,
#   rRNA_23S <dbl>, tRNAs <dbl>, Country <chr>, Continent <chr>, Domain <chr>,
#   Phylum <chr>, Class <chr>, Order <chr>, Family <chr>, Genus <chr>,
#   Species <chr>
print(readr::read_tsv(here("data/03_dat_aug.tsv")))
# A tibble: 618 × 25
   Genome        Genome_type  Length N_contigs    N50 GC_content Completeness
   <chr>         <chr>         <dbl>     <dbl>  <dbl>      <dbl>        <dbl>
 1 MGYG000303700 MAG          678213         2 466332       47.8         63.7
 2 MGYG000303701 MAG         1500176        18 112881       42.4         87.8
 3 MGYG000303702 MAG         1210062        44  48790       26.4         94.8
 4 MGYG000303703 MAG         1706016        27  89653       44.6         93.7
 5 MGYG000303704 MAG          703182         7 111709       47.8         63.7
 6 MGYG000303705 MAG         2542045       112  34925       48           97.9
 7 MGYG000303706 MAG         1449687       185  10153       34.8         85.2
 8 MGYG000303707 MAG         1874692        90  28768       37.1         99.0
 9 MGYG000303708 MAG         1480380        12 169949       42.2         87.6
10 MGYG000303709 MAG          694644        57  15063       47.9         62.0
# ℹ 608 more rows
# ℹ 18 more variables: Contamination <dbl>, rRNA_5S <dbl>, rRNA_16S <dbl>,
#   rRNA_23S <dbl>, tRNAs <dbl>, Country <chr>, Continent <chr>, Domain <chr>,
#   Phylum <chr>, Class <chr>, Order <chr>, Family <chr>, Genus <chr>,
#   Species <chr>, Completeness_quality <chr>, Contamination_quality <chr>,
#   Overall_quality <chr>, endometriosis_associated <lgl>

DATA DESCRIPTION

  • 618 vaginal metagenome-assembled genomes (MAGs)
  • 25 variables covering taxonomy, assembly quality, and geography
  • High completeness and low contamination for most genomes
  • Dataset dominated by a few major bacterial phyla
  • Genome lengths fall within biologically expected ranges

Most MAGs belong to only a few dominant phyla.
This indicates strong taxonomic skew in the dataset.


Most genomes have high completeness (>90%),
indicating generally strong assembly quality.


Genome lengths fall within the expected biological range
for vaginal bacterial taxa (typically 1.5–3 Mb).

ANALYSIS 1

ANALYSIS 2

ANALYSIS 3 - Associated and non-associated-endometriosis MAGs

  • Compared endometriosis-associated vs non-associated MAGs
  • Focused on GC content, genome length, completeness & contamination
  • Investigated whether associated MAGs cluster taxonomically
  • Goal: determine if associated MAGs form a genomically distinct group

Endometriosis-associated MAGs occur in only a few phyla.
Most phyla contain no associated MAGs, suggesting limited taxonomic clustering.


GC content ranges overlap almost completely.
No evidence that GC% distinguishes associated vs non-associated MAGs.

ANALYSIS 4 - Species Distribution between countries

  • Investigating the distribution of lineage groups in Countries
  • Counted group instances for each country and wide format
  • Filtered for NA in Countries
  • Big difference in sample size –> normalize

Some variation between countries. I.e. Order Bacteroidales, but not much.

Could’ve tested for significance.

Only two principal components –> 100% variance.

Clear division of countries.

Along PC1, Fusobacteria and Bacteroidota (order = Bacteroidales). Correlation with heatmap.

DISCUSSION

  • High-quality MAGs with good completeness
  • No strong genomic differences between groups
  • Limited metadata and uneven sampling

FUTURE PERSPECTIVES

  • Improve clinical + geographic metadata

CONCLUSION